Learning Multiple-Nonterminal Synchronous Grammars for Statistical Machine Translation
نویسنده
چکیده
Recent work in machine translation has evolved from the traditional word and phrase based models to include hierarchical phrase based and syntax-based models. These advances are motivated by the desire to integrate richer knowledge within the translation process to explicitly address limitations of the purely lexical phrasebased model. Generalized phrases as discussed in (Chiang, 2005) attempt to directly address the limitiations of purely lexical phrases, and have shown significant improvements in translation quality by introducing constructs for sub-phrase representation. However, generalizations are represented by a single sub-phrase category (and a glue rule for serial combination), providing the ability (and risk) of inserting any available sub-phrase into a larger phrase. The first contribution of this dissertation work is syntax-augmented machine translation (SAMT), an extension to Chiang’s model that provides multiple generalization types based on the phrase-structure parse trees of the training target sentences. We report improvements over strong phrase-based as well as hierarchical phrase-based baselines for French-to-English, Chinese-to-English, and Urdu-toEnglish. Syntax-based models such as SAMT typically rely on word-alignments and parse trees of the training sentence pairs, which are assumed to be correct. In reality, these alignments and parses are not human-generated, but instead result from the most probable configuration of a stochastic model. We provide a method to induce grammars over hidden alignments and parses, approximated from N -best lists. We present results showing improvements for hierarchical phrase-based MT as well as SAMT when using the widened pipeline. The SAMT model presupposes the availability of phrase-structure parse trees for the target training sentences. However, syntactic parsers are only available for a limited set of languages. We propose a labeling approach that is based merely on part-of-speech analysis of the source or target language (or even both). When using English POS tags in our labeling approach we achieve improvements in translation quality over Chiang’s hierarchical phrase-based MT model. We propose the application of the model to automatically learned word tags as future dissertation work. We further propose to induce a multi-nonterminal grammar from training data without any linguistic annotations based on a generative model over initial rules whose labels are latent variables. Our algorithm’s underlying model assigns class labels to phrase pairs. These labeled phrase pairs are then used as initial rules fed to the rule extraction algorithm of Chiang (2005), resulting in a synchronous gram-
منابع مشابه
Rule extraction for multi bottom-up tree transducers
Following the invention of computers, it was always a dream to obtain translations automatically. If we give a machine a sentence it should return a sentence in another language expressing the same meaning. In the subfield of statistical machine translation (SMT), this translation is achieved with the help of statistical models. Those models use large text collections to automatically learn bas...
متن کاملLearning Synchronous Grammars for Semantic Parsing with Lambda Calculus
This paper presents the first empirical results to our knowledge on learning synchronous grammars that generate logical forms. Using statistical machine translation techniques, a semantic parser based on a synchronous context-free grammar augmented with λoperators is learned given a set of training sentences and their correct logical forms. The resulting parser is shown to be the bestperforming...
متن کاملPreference Grammars: Softening Syntactic Constraints to Improve Statistical Machine Translation
We propose a novel probabilistic synchoronous context-free grammar formalism for statistical machine translation, in which syntactic nonterminal labels are represented as “soft” preferences rather than as “hard” matching constraints. This formalism allows us to efficiently score unlabeled synchronous derivations without forgoing traditional syntactic constraints. Using this score as a feature i...
متن کاملLearning Probabilistic Synchronous CFGs for Phrase-Based Translation
Probabilistic phrase-based synchronous grammars are now considered promising devices for statistical machine translation because they can express reordering phenomena between pairs of languages. Learning these hierarchical, probabilistic devices from parallel corpora constitutes a major challenge, because of multiple latent model variables as well as the risk of data overfitting. This paper pre...
متن کاملOn the String Translations Produced by Multi Bottom-Up Tree Transducers
Many current approaches to syntax-based statistical machine translation fall under the theoretical framework of synchronous tree substitution grammars (STSGs). Tree substitution grammars (TSGs) generalize context-free grammars (CFGs) in that each rule expands a nonterminal to produce an arbitrarily large tree fragment, rather than a fragment of depth one as in a CFG. Synchronous TSGs generate t...
متن کامل